Record Linkage Bootcamp


for Social Science Data



Eric Manning, PhD

February 27, 2026



  ericmanning / rl-bootcamp

What is this? Who are you?



Why a bootcamp (and not 🤖)


What this is not…

  • Statistics lecture
  • Failsafe cookbook

Objectives

  • “I can form an outline of the implementation steps in my head.”
  • “For each step, I know some things that might go wrong.”
  • “I can tell 🤖 what I want… and I can usually tell when it’s misguided.”


About us 👋

Bootcamp agenda




1) The framework

  • Step-by-step through the ER framework
  • Deterministic examples


3) In practice

  • Some time with non-human data inputs
  • Your research


2) The classic: Fellegi-Sunter

  • Basic probabilistic setup
  • Some notation, then demos


4) The frontier

  • AI-/LLM-enhanced linkage
  • What to do when you don’t trust your blocks (LSH)

Bootcamp setup



Recommended R packages:

install.packages(c(
  'dbplyr',     # imports a bunch of stuff
  'fastLink',   # imports stringdist
  'igraph',
  'duckdb',
  'fuzzylink',
  'zoomerjoin'
))



Python:

uv add splink   # incl. duckdb

1) The framework

Fundamental problems of record linkage: feasibility



Problem 1: The task is \(O(N_AN_B)\)


For merges, \(N_1 = N_2 = 10,000\) generates 100M comparisons across each of \(k\) linkage fields.


Deduplication: 50M comparisons.


Solution: Some observations are almost certainly not matches, so we don’t need to compare them.

Fundamental problems of record linkage: definitional



Problem 2: What is a “match?”


id first mi last housenum street city zip st
1 eric m manning 169 nassau princeton 08540 nj
2 ericm manning 169 nassau 08540
3 e manning 169 nassau 08540
4 e manning 160 nassau 08540
5 emily manning 160 nassau 08540


Solution: Know your data well. Examine large clusters, esp. those with partial matches.

A basic record linkage workflow



1. Pre-process input columns

2. Define blocking rules

3. Define comparison criteria

3a. Train an unsupervised probabilistic model?

4. Match by applying comparison criteria to blocked data

5. Resolve matches into underlying clusters (people, entities, etc.)

6. Generate canonical output


Some approaches blend these steps together (say, redefine comparison criteria iteratively based on match quality).

Brief foreshadowing



Basic deterministic linkage by first + last name (comparisons) within ZIP codes + gender (blocks)…

-- cols in dfA, dfB: id, first, last, zip, gender (M/F/NULL)
-- return just the indices of the matched pairs

SELECT l.id AS id_l, r.id AS id_r
FROM dfA AS l
INNER JOIN dfB AS r
-- blocking rules:
ON l.zip = r.zip AND COALESCE(l.gender = r.gender, true)
-- comparison filters:
WHERE
  jaro_winkler_similarity(l.first, r.first) >= 0.9
  AND jaro_winkler_similarity(l.last, r.last) >= 0.9;

Step 1: Pre-processing

Pre-processing: generic strings



Typical considerations:

  • Standardizing representation of missing values ("unknown", "NA", "NULL" "")
  • Case sensitivity
  • Trimming whitespace
  • Removing punctuation and special characters (accents, etc.)
    • stringr::stri_trans_general in R.
    • But are you sure?

Pre-processing: human name strings



The goal (usually):

name
ERIC MANNING
Eric M Manning
MANNING, ERIC M MR.
eric m. manning jr
name_first name_middle name_last name_suffix
eric manning
eric m manning
eric m manning
eric m manning jr


  • Name parsing options in R
    • install_github('mjfii/Name-Parser')
    • install_github('Nonprofit-Open-Data-Collective/peopleparser')
    • humaniformat::
  • Many Python libraries
  • You can typically get away with doing (some of) this on your own.

Pre-processing: addresses



This is generally quite hard if the addresses are not already pre-processed.

Addresses as strings… in two stages:

  • Parsing into constituent components (house number, street name, …)
  • Standardization of each component (“STREET” to “ST”)
  • USA ZIP codes are easy, but don’t forget the northeast
    • Use stringr::str_pad()
    • str_pad('8540', width = 5, side = 'left', pad = '0') == '08540'

Addresses are very weird. Please do not attempt to standardize them on your own unless you are willing to dedicate a lot of time to building your own tool.

Pre-processing: tools for addresses




Or… use any (other) geocoder.

Pre-processing: PostGIS (in Postgres) standardizer



SELECT (each(hstore(p))).* FROM standardize_address(
  'tiger.pagc_lex', 'tiger.pagc_gaz', 'tiger.pagc_rules',
  'One Devonshire Place, PH 301, Boston, MA 02109'
) AS p;

    key     |      value
------------+-----------------
 box        |
 city       | BOSTON
 name       | DEVONSHIRE
 qual       |
 unit       | # PENTHOUSE 301
 extra      |
 state      | MA
 predir     |
 sufdir     |
 country    | USA
 pretype    |
 suftype    | PL
 building   |
 postcode   | 02109
 house_num  | 1
 ruralroute |

Pre-processing: geocoding addresses



Geocoding usually comes with standardization:

  • Census (USA, free); ArcGIS (USA, international; free for you)


Please don’t ever pay for an address parser or geocoder (US only).


Geocoding converts messy address strings into structured lat/long coordinates. This means:

  • You can compare geospatial distance and (or instead of) address strings
  • Geocodes are standardized: misspellings, abbreviations, etc. all collapse to same coordinates
  • Geohashes or spatial indices (like H3) become natural blocking keys

Pre-processing: Geocoding in R using tidygeocoder



Use the Census API for bulk geocoding (10k batches):


addr = data.frame(a='169 Nassau Street, Princeton, NJ, 08540')

tidygeocoder::geocode(addr, method = 'census', address = 'a', verbose = T)


## # A tibble: 1 x 3
##   a                                         lat  long
##   <chr>                                   <dbl> <dbl>
## 1 169 Nassau Street, Princeton, NJ, 08540  40.4 -74.7


You can parallelize batches and the Census API will, generally, cooperate.

Pre-processing: Parallelized Census geocoding



geo_out = pbmclapply(
  geo_input,
  FUN = \(.x) {
    geocode(
      .x,
      street = geo_in, city = city, state = state, postalcode = zip,
      method = "census"full_results = TRUE,
      api_options = list(census_return_type = 'geographies'),
      custom_query = list(vintage = 'Census2010_Current')
    )[, keep_vars]
  },
  mc.cores = parallel::detectCores() - 1mc.preschedule = F
)

Pre-processing: Other tidygeocoder features



Cascading (useful when one service doesn’t cover all your addresses):

geocode_combine(
  data.frame(a=c('169 Nassau Street, Princeton, NJ, 08540',
                 'Anfield, Liverpool, UK')),
  queries = list(
    list(method = 'census'),
    list(method = 'osm')
  ),
  global_params = list(address = 'a'),
  cascade = T
)


## # A tibble: 2 x 4
##   a                                         lat   long query
##   <chr>                                   <dbl>  <dbl> <chr>
## 1 169 Nassau Street, Princeton, NJ, 08540  40.4 -74.7  census
## 2 Anfield, Liverpool, UK                   53.4  -2.96 osm

Pre-processing: Comparing geocoding service quality



out = lapply(c('osm', 'arcgis', 'census'),
  \(x) geocode(addr, method = x, address = 'a') %>% mutate(class = x)
) %>% bind_rows()

leaflet() %>% addTiles() %>% addMarkers(data = out, label = out$class)

Geocoding: practical considerations



  • ArcGIS will support ~1M geocodes/hour (with address standardization)

    • Consumes (library) credits unless you set up locally (see me)
    • International coverage
  • OpenStreetMap (OSM) coverage varies by country with rolling updates

  • IRB/other regulatory limitations to uploading address data

  • Local solution desirable for data ingestion pipelines (maybe)

Step 2: Blocking

Blocking



A blocking field is (usually) just a join key (i.e., exact match required).

Considerations:

  • You will almost certainly drop comparisons between matches.
    • Inherent tradeoff between tractability and false negatives.
    • Recall in real-world examples is often < 0.75.
  • Typically the cleanest/most reliable field(s) in your dataset.
  • Hopefully high cardinality (GEOIDs >> ZIP codes)
  • Blocking fields are often skewed (state, first name, last name)
  • Required specificity determined by data size, skew, cardinality of blocking vars, and compute resources.
  • Join key types typically affect performance quite a bit. Avoid strings if possible.

KNOW YOUR DATA WELL ENOUGH TO SELECT GOOD BLOCKS.

Blocking: Example of reduction in comparison space



There were 3.4M SSA births in 1980. What if we compared everyone to each other?

11.7 trillion comparisons.


What if we block on gender?

5.9 trillion comparisons.


What if we block on gender and first initial?

513 billion comparisons.


What if we block on gender and first name?

47 billion comparisons (~12min w/ DuckDB’s Jaro-Winkler)

Blocking: Practical considerations



What’s wrong with this?

[ ... ]
FROM dfA l
INNER JOIN dfB r
ON jaro_winkler_similarity(l.first, r.first) > 0.9


Get creative!

FROM dfA l
INNER JOIN dfB r
ON (l.zip = r.zip AND l.first = r.first)
  OR (coalesce(l.gender = r.gender, true) AND l.zip = r.zip AND l.last = r.last)
  OR (coalesce(l.gender = r.gender, true) AND l.first[1] = r.first[1] AND l.last[1] = r.last[1])
  OR (l.zip = r.ZIP AND l.last = r.last)
  [ ... ]

Blocking: common blocking variables and their problems

  • Higher level geography (city, state, ZIP code)
    • Cities are often misspelled or abbreviated ("New York City" vs. "NYC")
  • Gender (usually imputed from SSA records)
    • Madison between 1980 and 2000
  • Names (or their substrings – e.g., first letters)
    • Beware misspellings early in strings
    • What about nicknames? (more later)
      • (l.first IN r.first_nicknames) OR (r.first IN l.first_nicknames)
    • Beware last name changes by gender

Be sure you handle missingness as intended.

  • Joining on nulls?

Blocking: Constructing new indices



Sometimes we don’t have clean join keys anywhere.

  • Convert name strings to phonetic representations (soundex, [double-]metaphone)
    • {JOHN, JHON, KOHN} maps to {J500, J500, K500}
  • Locally sensitive hashing (later) for strings or points
  • Cosine similarity/nearest neighbors with semantic embeddings for strings (later)

What about geospatial data?

Blocking with H3: Hexagonal hierarchical indexing



H3 (Uber) tiles the globe into nested hexagons at 16 resolutions.

  • Each lat/long = one H3 cell index (a 64-bit integer).

  • Resolution controls block size (res 6 ≈ 36 km², res 8 ≈ 0.74 km²)

Key idea for blocking: expand each cell to a k-ring (cell + all neighbors within \(k\) steps) and compare only within overlaps.

There are APIs for basically every language.

H3 k-ring neighbor structure (h3geo.org)

H3 Blocking Example in DuckDB



INSTALL h3 FROM community; LOAD h3;

CREATE TABLE records AS FROM (VALUES
  (1, 40.7128, -74.0060),
  (2, 34.0522, -118.2437),
  (3, 40.7282, -74.0776)
) t(id, lat, lng);

ALTER TABLE records ADD COLUMN h3 UBIGINT;
UPDATE records SET h3 = h3_latlng_to_cell(lat, lng, 3); -- ~68km len

SELECT a.id AS id_a, b.id AS id_b
FROM records a JOIN records b
ON h3_grid_distance(a.h3, b.h3) <= 1 -- overlapping 1-neighbor k-rings
  AND a.id < b.id;

Step 3: Defining comparisons

Comparisons: string distance (1 - similarity) functions



Levenshtein: Number of insertions, deletions, substitutions

Damerau-Levenshtein: same, but with transpositions

Jaccard (bag of n-grams): \[1 - \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}\]

Jaro: \(\in [0,1]\), function of matching characters and their positions, transpositions

Jaro-Winkler: Upweights matching at beginning of string - Most common for names (developed for them)

Phonetics (soundex, [double-]metaphone) … and many others


More recently, token sort ratio functions (e.g., via RapidFuzz)

Comparison functions for strings (similarities)



Function (AMELIA, MALIA) (BETH, LISBETH) (JOHN, JHON)
Levenshtein 0.66 0.57 0.50
Damerau-Levenshtein 0.66 0.57 0.75
Jaccard 0.80 0.57 1.00
Jaro-Winkler 0.88 0 0.93
Soundex (0 or 1) 0 0 1.00

Notes:

stringdist::stringsim('NEW YORK CITY', 'NYC', 'jaccard')     # 0.27
stringdist::stringsim('ELIZABETH', 'BETH', 'jw')             # 0.45
stringdist::stringsim('5th AVE', '6th AVE', method='jw')     # 0.90
## but...
stringdist::stringsim('FIFTH AVE', 'SIXTH AVE', method='jw') # 0.85

Comparisons for other data types



  • Arrays: any intersection (or Jaccard)
  • Geospatial: lat/long distances (typically Euclidean approx.)
  • Dates: probably NOT int distance (unless good reason)
    • Which pair is the likeliest match?
      • (2005-03-17, 1005-03-17) OR (2005-03-17, 2023-02-12)
    • Beware of 1900-01-01 (missing!) or maybe even 1970-01-01 (UNIX epoch)
  • Age: int distance (e.g., duration between data capture +/- 1 as threshold)

How to handle addresses?

Step 4 (maybe): Model training

Training: when and why



Deterministic output: \(\mathbb{P}(M_{ij}) \in \{0,1\}\)


This assumes you know the criteria that defines a match (well enough).


But what if you don’t? (Or what if you want a posterior probability?)

Training: Probabilistic model types



Basic probabilistic model types:

  • Unsupervised: Fellegi-Sunter model (later)
    • Catch: You have to define the blocking rules and the comparison levels for each field
    • Benefits: Posteriors; not writing out all comparison combinations in a big rule set
  • Supervised: Train some observation pairs, then train a predictive model (xgboost, probably)
    • Catch: You have to label data (which data??)
    • Benefit: Comparison levels and/or fields learn complex relationships
    • Example: Census Tree Project
  • Semi-supervised: human-in-the-loop active learning approaches

Step 5: Matching

Basic deterministic matching: wrong



What’s wrong with the following?


# cols in dfA, dfB: id, first, last, zip, gender

# dplyr will join data.frames, tibbles, etc. on NULLs
inner_join(dfA, dfB, by = c('zip', 'gender'), suffix = c('_l','_r')) |>
  filter(
    stringdist::stringdist(first_l, first_r, 'jw', p = 0.1) >= 0.9,
    stringdist::stringdist(last_l,  last_r,  'jw', p = 0.1) >= 0.9
  ) |>
  select(id_l, id_r)

# then join the NULL gender obs in dfA to all of dfB
# and vice versa

Basic deterministic matching (correctly, from R)



CREATE TABLE match_inds AS
  SELECT
    -- return just the indices of the matched pairs:
    l.id AS id_l, r.id AS id_r
  FROM
    dfA AS l
  INNER JOIN
    dfB AS r
  -- blocking rules:
  ON l.zip = r.zip AND coalesce(l.gender = r.gender, true)
  -- comparison filters:
  WHERE
    jaro_winkler_similarity(l.first, r.first) >= 0.9
    AND jaro_winkler_similarity(l.last, r.last) >= 0.9
;

Query is faster and will consume less memory. Will spill to disk if necessary.

Basic deterministic matching (correctly, from R)



con = dbConnect(duckdb('path_to.db', ...))

tbl(con, 'dfA') |>
  inner_join(tbl(con, 'dfB'), by = c('zip', 'gender'), suffix = c('_l', '_r')) |>
  filter(
    jaro_winkler_similarity(first_l, first_r) >= 0.9,
    jaro_winkler_similarity(last_l,  last_r)  >= 0.9
  ) |>
  select(id_l, id_r) |>
  compute(
    name      = 'match_inds',
    temporary = FALSE
  )

Remember! What’s different here vs. base R?

Why does jaro_winkler_similarity(...) work in the call to filter()?

Step 6: Clustering

Clustering



We have a bunch of matched pairs (with posteriors). Match status might not be transitive. Now what?

Most commonly, connected components

  • Observations are nodes, matches are edges
  • Find all (connected) clusters
  • Requires discretizing match status from posteriors

Can implement other algorithms (with probabilities as IPWs) if computationally feasible (e.g., label propagation).

  • See community detection algorithms in the onager extension for DuckDB here.

Clustering example



In R, generate graph object from match pairs list and use igraph::components(graph_obj).

In DuckDB:

install onager from community;
load onager;

create table edges as
select *
from (values (1::bigint, 2::bigint),
             (2, 3),
             (3, 1),
             (4, 5),
             (5, 6)) t(src, dst);

from onager_cmm_components((select src, dst from edges))

(What’s the answer?)

See here for manual coding example in Python/DuckDB

Clustering: common issues



Household overclustering (!!)

  • What are your minimal requirements for a match on first name?

Overclustering is particularly important for descriptive relationships…

(Over-)Clustering example


Step 7: Canonicalization

Canonicalization

In some cases, you may need the “best record” for each cluster.

  • Most recent?
  • Most common?
  • One at random?

Some options for inference:

  • Naive bootstrap for records
  • Posteriors (in probabilistic approach) as sampling or propensity weights
  • Quickly becoming clear why connected components is dispensing with a lot of potentially useful information…
  • There are principled ways of doing this

2) The Classic: Fellegi-Sunter PRL

Fellegi-Sunter probabilistic record linkage model

Fellegi and Sunter (1969, JASA) | Enamorado, Fifield, and Imai (2019, APSR)



Do you need this? What is your end goal?

Benefit (0): You don’t define all of the possible “match” criteria combinations across linkage fields or label any label data.

Benefit (1): We formalize assumptions necessary to incorporate posterior match probabilities as weights when merged data is outcome or explanatory variable.

Benefit (2): We can incorporate prior match probabilities from existing information (e.g., migration rates, name frequency weighting).

You must specify blocks, comparison criteria/levels, posterior threshold for a match (sometimes), and (optional) priors.

Several robust, efficient implementations across languages.

Fellegi-Sunter: setup

Less math-heavy tutorial from Linacre (here).



Suppose we are matching datasets \(A\) and \(B\) across \(K\) fields.

Similarity between \(i \in A, j \in B\) in field \(k\) is \(S_k(i,j)\), typically \(\in [0,1]\).

We coarsen this similarity to an agreement pattern \(\gamma_k(i,j)\) where: \[\gamma_k(i,j) = \begin{cases} 2 & S_k(i,j) \geq \tau_2 \\ 1 & \tau_1 \leq S_l < \tau_2 \\ 0 & \text{otherwise} \end{cases}\]

This is generalizable to distance metrics, discrete metrics (e.g., Levenshtein), etc.

For \(i, j\), we have \(\boldsymbol{\gamma}(i,j) = \{\gamma_1(i,j), ..., \gamma_k(i,j)\}\) with some number of agreement levels \(L_k\).

Fellegi-Sunter: missingness



Missingness is recorded as \(\delta_k(i,j)\) where: \[\delta_k(i,j) = \begin{cases} 1 & \text{if either } i_k \text{ or } j_k \text{ missing} \\ 0 & \text{otherwise} \end{cases}\]

Let \(m\in \{0,1\}\) be the (unknown) match status for \(i,j\), which is distributed

\[ M(i,j) \stackrel{iid}{\sim} \text{Bernoulli}(\lambda) \]

Assume that the probability of each agreement level is distributed

\[\gamma_k(i,j) | M(i,j) = m \stackrel{indep}{\sim} \text{Discrete}(\boldsymbol{\pi}_{km})\]

Fellegi-Sunter: math (likelihood function)



The parameters \(\boldsymbol{\gamma}, \boldsymbol{\delta}\) are given by the observed data. The observed data likelihood, then, is: \(l_{\text{obsv}}(\lambda, \boldsymbol{\pi} | \boldsymbol{\delta}, \boldsymbol{\gamma}) \propto\)

\[\prod_{i=1}^{N_A} \prod_{j=1}^{N_B} \left\{ \sum_{m=0}^{1} \lambda^m (1-\lambda)^{1-m} \prod_{k = 1}^{K} \left( \prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}}\right)^{1-\delta_{k}(i,j)}\right\}\]

The probability of \((i,j)\) being a match is:

\[\begin{equation*} \begin{split} \xi_{i,j} & = \mathbb{P}(M_{ij} = 1 | \boldsymbol{\delta}(ij),\boldsymbol{\gamma}(i,j)) \\ &= \frac{\lambda \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{k1l}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}}{\sum_{m=0}^{1}\left\{\lambda^m(1-\lambda)^{1-m} \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}\right\}} \end{split} \end{equation*}\]

Fellegi-Sunter: more math (the E-M part)



\[\begin{equation*} \begin{split} \xi_{i,j} & = \text{Pr}(M_{ij} = 1 | \boldsymbol{\delta}(ij),\boldsymbol{\gamma}(i,j)) \\ &= \frac{\lambda \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{k1l}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}}{\sum_{m=0}^{1}\left\{\lambda^m(1-\lambda)^{1-m} \prod_{k = 1}^{K} \left(\prod_{l=1}^{L_k-1} \pi_{kml}^{\boldsymbol{1}\{\gamma_k(i,j) = l\}} \right)^{1-\delta_{k}(i,j)}\right\}} \end{split} \end{equation*}\]

We need to compute the \(\lambda\) and the \(\pi_{kml}\) parameters, given by:

\[\begin{equation*} \begin{split} \lambda &= \frac{1}{N_A N_B}\sum_{i =1}^{N_A}\sum_{j=1}^{N_B} \xi_{ij} \\ \pi_{kml} &= \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\gamma_k(i,j) = l\}(1-\delta_k(i,j))\xi_{ij}^m(1-\xi_{ij})^{1-m}}{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}(1-\delta_k(i,j))\xi_{ij}^m(1-\xi_{ij})^{1-m}} \end{split} \end{equation*}\]

Iterate until convergence within tolerance.

Estimating global model accuracy



False positive rate (1 - precision), for some posterior match threshold \(S\):

\[\mathbb{P}(M_{ij} = 0 | \xi_{ij} \geq S) = \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} \geq S \}(1-\xi_{ij})}{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} \geq S \}}\]

False negative rate (1 - recall):

\[\mathbb{P}(M_{ij} = 1 | \xi_{ij} \geq S) = \frac{\sum_{i=1}^{N_A}\sum_{j=1}^{N_B}\boldsymbol{1}\{\xi_{ij} < S \}}{\lambda N_A N_B}\]

Handling Fellegi-Sunter outputs



F-S proposed three “regions” of the observed distribution of the posterior probabilities:

  • Discard region \([0, S_1)\)
  • Manual inspection (!) region \([S_1, S_2)\)
    • Should be really small! Distribution is VERY bimodal.
  • Match region \([S_2, 1)\)

Most people just apply a threshold (the default is 0.85) and move on.

Recommendation: investigate some pairs in each observed \(\boldsymbol{\gamma}\) combination near some \(S_1\) and apply post-hoc filtering. (Recall household overclustering problem.)

Fellegi-Sunter implementations

Demo time ▶️

Custom comparison: blending ZIP and city



place_comp = CustomComparison(
  output_column_name = "place",
  comparison_description = "Combo place (ZIP, city)",
  comparison_levels = [
    cll.CustomLevel(
      "(a_zip_l IS NULL OR a_zip_r IS NULL) AND (a_city_l IS NULL OR a_city_r IS NULL)"
      ).configure(is_null_level=True),

    cll.ExactMatchLevel("a_zip", label_for_charts="Exact (zip)"
      ).configure(term_frequency_adjustments=True),

    cll.ExactMatchLevel("a_city", label_for_charts="Exact (city)"
      ).configure(term_frequency_adjustments=True),
      cll.JaroLevel("a_city", distance_threshold=0.9),

    cll.ElseLevel()
])

Custom comparisons: motivations

Key ideas:

  • Helps to avoid voilations of conditional independence assuption
    • Great for columns that reflect same info but don’t have same structure
    • Would same thing for addresses v. PO boxes
  • Levels are evaluated in order — first match wins
  • term_frequency_adjustments downweights common values (e.g., “NEW YORK” vs. rare ZIP codes)

3) In practice

(Non-human) entity name matching



This is a miserable problem, usually compounded by a lack of blocking variables.

“J.P. Morgan Chase” vs. “JPMORGON” vs. “JPM”

  • "JP MORGAN" and "CHASE BANK" are the same company now, but were not before 2000.

  • “FACEBOOK” vs. “META PLATFORMS, INC.”

There is a string similarity problem AND an information problem.

  • (“Lexical similarity” vs. “semantic similarity”)

If we could create an \(n\)-dimensional representation of the string and incorporate outside information…


Libgober and Jerzak (2024, PSRM): LinkOrgs software and paper

Kaufman and Klevs (2022, Political Analysis): Adaptive Fuzzy String Matching

Your research

Finally, the frontier

LLM-Enhanced Record Linkage

Blocking Without Exact Matching:
Locality-Sensitive Hashing

Based on Leskovec, Rajaraman, and Ullman, Mining of Massive Datasets, Ch. 3

LSH: motivation



We want: similar strings → similar hashes (buckets), in ~\(O(N)\) time.

Recall that

\[J(C_1, C_2) = \frac{|C_1 \cap C_2|}{|C_1 \cup C_2|}\]

The pipeline:

  1. Shingling → represent each string as a set
  2. Min-hashing → compress sets into short signatures
  3. LSH (banding) → generate candidate pairs from signatures

This gives us a way to block without exact matching — we can find candidate pairs that are approximately similar, without comparing every pair.

Shingling (k-shingles)



“k-shingles” are n-grams of characters (or words). Here, character-level:

JOHN as 2-grams: {JO, OH, HN}

Now represent every string by its k-shingle set:

\(\text{Jaccard}(\texttt{JOHN}, \texttt{JOHNN}) = \frac{|\{\texttt{JO, OH, HN}\}|}{|\{\texttt{JO, OH, HN, NN}\}|} = \frac{3}{4}\)

  • For short strings, pick small \(k \approx 5\)
  • For longer strings, layer \(k\) so you don’t have all shingles in every string
  • Optional: hash shingles again to improve storage — e.g., {JO, OH, HN}{2, 1, 7}

Shingling: the problem



One-hot encode each string by its character shingles:

JOHN JOHNN
J 1 1
O 1 1
H 1 1
N 1 1
E 0 0

With many unique shingles, this matrix is very sparse and requires a lot of memory to store.

Still \(\frac{N(N-1)}{2}\) pairwise comparisons. Can’t do this on one-hot encoded vectors at scale.

Min-hashing: idea



We want to find similar columns while computing small signatures.

Idea: “hash” each column (i.e., each string’s one-hot encoding) so that the hashes all fit in memory.

Key property: for Jaccard similarity, we can find a hash \(h\) such that

\[J(C_1, C_2) \propto \Pr(h(C_1) = h(C_2))\]

This hash function is called the min-hash.

Min-hashing: how it works



Randomly permute the rows of the one-hot matrix. For each column (string), the min-hash is the index of the first (“minimum”) row with a 1.

Initial shingles-by-docs one-hot matrix:

\(S_1\) \(S_2\) \(S_3\) \(S_4\)
a 1 0 0 1
b 0 0 1 0
c 0 1 0 1
d 1 0 1 1
e 0 0 1 0

Three random permutations of the rows:

\(P_1\) \(P_2\) \(P_3\)
a 4 5 1
b 5 1 2
c 2 4 5
d 1 3 4
e 3 2 3

Yields a signature matrix:

\(S_1\) \(S_2\) \(S_3\) \(S_4\)
\(P_1\) 1 2 1 1
\(P_2\) 3 4 1 3
\(P_3\) 1 5 2 1

Min-hashing: why it works



For a random permutation \(P_i\), \(\Pr(P_i(S_1) = P_i(S_2)) = J(S_1, S_2)\)

Why? For any random row, the pair \((S_1, S_2)\) is one of:

Row values Outcome
\((1, 1)\) count as \(X\)
\((0, 1)\) or \((1, 0)\) count as \(Y\)
\((0, 0)\) skip, move to next row

\[\Pr(P_i(S_1) = P_i(S_2)) = \frac{X}{X + Y}\]

which is exactly the Jaccard similarity.

So: similarity of \(S_1\) and \(S_2\) \(\approx\) fraction of hash functions in which their signatures agree.

As \(N_{\text{hashes}} \to \infty\), this converges to Jaccard.

Min-hashing: practical tweaks



Now we have a much denser representation of our matrix. (FYI, 100 permutations is common.)

But permutations and storage are both quite memory-intensive.

Tweak 1: Instead of \(N\) actual permutations, use \(N\) hash functions that map row numbers to buckets.

Universal hashing: \(h(x) = ((ax + b) \bmod p) \bmod N\)

where \(x\) is the row number, \(a, b\) are random integers, and \(p\) is a prime \(> N\).

Example: \(h_1 = (x + 1) \bmod 5\), \(\quad h_2 = (3x + 1) \bmod 5\)

As long as the number of buckets is large enough to avoid too many collisions, this approximates random row permutations without actually permuting.

We can work string by string without storing the full one-hot matrix.

Computing the signature matrix



For each string \(S_i\): for each row where \(S_i = 1\), update \(\text{sig}(h, S_i) \gets \min(\text{current}, h(\text{row}))\).

\(S_1\) \(S_2\) \(S_3\) \(S_4\) \(h_1\) \(h_2\)
row 1 1 0 0 1 1 1
row 2 0 0 1 0 2 4
row 3 0 1 0 1 3 2
row 4 1 0 1 1 4 0
row 5 0 0 1 0 0 3

Signature matrix:

\(S_1\) \(S_2\) \(S_3\) \(S_4\)
\(h_1\)
\(h_2\)

Min-hashing: Tweak 2



Problem: Still scanning \(N_{\text{shingles}}\) rows per hash function.

Tweak 2: Only examine \(M\) rows per hash function, where \(M \ll K\). This gives \(\sim K/M\) reduction in compute.

This is an approximation. For two random strings \(S_i, S_j\):

  1. If both \(= 0\) \(\forall\) \(M\) rows → we learn nothing from this function
  2. If only one \(= 0\) \(\forall\) \(M\) rows → we know they’re quite different (only one gets updated)

As long as case (1) is rare, this isn’t a serious problem — and we can spend the saved time computing more hash functions.

The signature matrix: now what?



We have an \(N_{\text{hashes}} \times N_{\text{obs}}\) signature matrix. Say \(250 \times 1\text{M}\): that’s ~1 GB.

But still \(\binom{1\text{M}}{2} \approx 500\text{B}\) comparisons!

We could hash each signature to some value (and then block on that), but we’d only catch items with exactly the same signatures.

We need a way to find approximately matching signatures…

Locality-sensitive hashing (banding)



Divide the signature matrix into \(b\) bands of \(r\) rows, where \(b \cdot r = N_{\text{hashes}}\).

1 2 3 4 b Si r rows per band

“Candidate” pairs: strings that have the same signature in at least one band.

How to find them? Hash each band’s signature to a bucket. Look at buckets with more than one item.

Locality-sensitive hashing (banding): example



Example: 100,000 strings, 100 signatures → ~40 MB

With \(b = 20, r = 5\), find pairs with Jaccard \(\geq 0.8\):

\[\Pr(C_1, C_2 \text{ match in one band}) = (0.8)^5 \approx 0.328\]

\[\Pr(\text{not similar in ANY band}) = (1 - (0.8)^5)^{20} \approx 0.00035\]

We miss ~1 in 3,000 true pairs.

Tuning \(b\) and \(r\): the S-curve



The probability of becoming a candidate pair at Jaccard similarity \(s\) is:

\[1 - (1 - s^r)^b\]

This is an S-surve.

  • Above the threshold: false negatives (pairs we missed)
  • Below the threshold: false positives (non-matches flagged as candidates)


Didn’t cover: Euclidean LSH for vectors

Introduction to zoomerjoin in R



Great package for visualizing and implementing LSH.

A (0.1, 0.7, 0.001, 0.99)-sensitive LSH means that strings with similarity less than 0.1 will have a 0.1% chance of being compared, while strings with 0.7 similarity have a 99% chance of being compared.

library(zoomerjoin)

jaccard_hyper_grid_search(s1 = 0.1, s2 = 0.7, p1 = 0.001, p2 = 0.99)
band_width    n_bands
         5         26

Visualizing LSH with zoomerjoin



jaccard_curve(band_width = 5, n_bands = 26)

Creating blocks with zoomerjoin



x = read_csv('name_counts.csv') ## SSN first names, 1950-2006

x$cluster = jaccard_string_group(
  string        = x$first,
  n_gram_width  = 2,        # 2-shingles (bigrams)
  n_bands       = 26,       # b = 26 bands
  band_width    = 4,        # r = 5 rows per band
  threshold     = 0.7,      # Jaccard similarity threshold for "match"
  progress      = TRUE
)

Block diagnostics with zoomerjoin (1/2)



## largest clusters
summarise(x, n = sum(N), .by = 'cluster') %>% arrange(-n)
   cluster          n
 1 LINDA     16451074
 2 SANDRA     8000394
 3 KAREN      7973116
 4 PAMELA     7548071
 5 HELEN      5752257
 6 MARY       5179872

Block diagnostics with zoomerjoin (2/2)



filter(x, cluster == 'LINDA') # inspect
   first   gender      N cluster
 1 LINDA   F      876697 LINDA  
 2 DIANE   F      368332 LINDA  
 3 JANE    F      154558 LINDA  
 4 DIANA   F      271258 LINDA  
 5 ANNE    F      168358 LINDA  
 6 DIANNE  F       58080 LINDA  
 7 ANNA    F      324602 LINDA  
 8 ANNIE   F       62001 LINDA  
 9 JANIS   F       27071 LINDA  
10 LILLIAN F       78254 LINDA  

Joining datasets with zoomerjoin



jaccard_inner_join(
  dfA, dfB,
  by                 = 'col1',
  block_by           = 'col2',
  n_gram_width       = 2,
  n_bands            = 50,
  band_width         = 8,
  threshold          = 0.7
)

lsh, the DuckDB version


Syntax: lsh_min(string, ngram_width, band_count, band_size, seed) (docs)


INSTALL lsh FROM community;
LOAD lsh;

CREATE OR REPLACE TEMPORARY TABLE temp_names (
    name_a VARCHAR,
    name_b VARCHAR
);

INSERT INTO temp_names (name_a, name_b) VALUES
    ('Charlotte Brown', 'Charlene Browning'),
    (NULL, 'Davis Martin'),
    ('Olivia Thomas', 'Olive Thomason'),
    ('Alice Johnson', NULL);

SELECT lsh_min(name_a, 2, 3, 2, 123) AS hash FROM temp_names;

How to block with lsh



Storing these vectors is very memory-intensive. Instead,

SELECT A.ind, B.id FROM A INNER JOIN B
ON lsh_min(A.col, 2, 1, 3, 1)[1] = lsh_min(B.col, 2, 1, 3, 1)[1]
WHERE [...]

UNION

SELECT A.ind, B.id FROM A INNER JOIN B
ON lsh_min(A.col, 2, 1, 3, 2)[1] = lsh_min(B.col, 2, 1, 3, 2)[1]
WHERE [...]

[...]

SQL is sometimes the “lower level” language.

Next steps



The record linkage/entity resolution literature is enormous. There are many supervised and unsupervised methods, some of which merge some of the steps in our framework.